
This week we learn about
Data visualization is omnipresent in science. Visualizations range from presenting raw data to illustrating analysis results or modeling outcomes. The way visualizations are constructed should, as any other part of the analysis, be reproducible and adhering to the basic principles of good scientific practice. You will practice reproducible data analysis skills while learning about best practice for graphs.
In the following sections we will have a look at different visualizations and things to be aware of when using them with the goal of transmitting information truthfully. The most important principles of good practice for visualizations are
We provide code in base R and ggplot. A
short introduction to ggplot is provided at the end, a good
reference for both systems is this https://bookdown.org/rdpeng/exdata/
library(tidyverse)
{: .language-r}
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
{: .output}
# See https://github.com/clauswilke/colorblindr for installation instruction
library(colorblindr)
{: .language-r}
Loading required package: colorspace
{: .output}
mintheme <- function(){
theme(legend.position = "none",
panel.grid = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
axis.line = element_line(),
axis.ticks = element_blank())
}
medtheme <- function(){
theme(legend.position = "none",
panel.grid = element_blank(),
panel.background = element_blank(),
axis.line = element_line(),
)
}
knitr::opts_chunk$set(echo = TRUE)
show_results <- FALSE
hide_results <- function(input, show=TRUE){
if(show){
return(input)
} else{
return("")
}
}
{: .language-r}
As a basic principle it is useful to consider the relationship of visual cues, i.e. the type of visual encoding of quantitative data such a bars or areas, and the accuracy of the understanding of a viewer of these visualizations. The graph on the right shows how accurately the visualizations are perceived for different types of representation. Lengths (in form of bars) represent the data most accurately while volumes are rather generic and are more difficult to be perceived accurately.
The linked picture is based on Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods by William S. Cleveland and Robert McGill.
Therefore, when creating a visualization you should consider the best type of visual cue that represents the data best with the goal of transmitting the intended message. For good perception of a message it is clearly better to provide simple visualizations. We discuss some specific points in more detail below.
Providing simple and easily perceptible visualizations implies that you should avoid 3-dimensional graphical representations in most circumstances. Consider the following visualization:
As you can see (or not see!) some data is hidden behind the different bars. Furthermore it is rather difficult (and misleading) to compare the height from different depths. Another point not related to 3D in this graph are the missing axis labels and the missing legend for the colors.
As a general principle we can conclude from the 3D example that you should always avoid occlusion of some parts of the visualization. An example can be found in the following plot showing multiple densities in the same panel. The different densities where colored according to group but only the density in the front is fully visible.
set.seed(1234)
df <- data.frame(x=sample(1:10,1000, replace = TRUE),name=rep(letters[1:5],200))
df$x[df$name == "a" & df$x < 9] <-sample(1:6,sum(df$name == "a" & df$x < 9), replace = TRUE)
df$x[df$name == "b"] <- rep(1:10, 20)
ggplot(df) +
geom_density(aes(x,color=name, fill=name)) +
mintheme()
{: .language-r}
An alternative is to plot lines which allows us to see all groups completely.
ggplot(df) +
geom_density(aes(x,color=name)) +
mintheme()
{: .language-r}
Pie charts can be considered an alternative to bar charts, although often not a good one since they use angles as visual cues. For instance look at the following three visualizations. First a barplot, second a stacked barplot and lastly a pie chart. Where are differences most visible?
set.seed(123)
x1 <- table(factor(c(rbinom(100,2,0.5),rep(0:2,100))))
x2 <- table(factor(c(rbinom(100,2,0.4),rep(0:2,100))))
x3 <- table(factor(c(rbinom(100,2,0.3),rep(0:2,100))))
df <- data.frame(x=c(x1,x2,x3), time = factor(rep(1:3,each=3)),var=c(names(x1),names(x2),names(x3)))
ggplot(df) +
geom_bar(aes(x=time,y=x, fill=var),position="dodge", stat="identity", width=1) +
mintheme()
ggplot(df) +
geom_bar(aes(x=time,y=x, fill=var),stat="identity", width=1) +
mintheme()
{: .language-r}
for (i in 1:3) {
print(ggplot(df %>% dplyr::filter(time==i)) +
geom_bar(aes(x=time,y=x, fill=var),stat="identity", width=1) +
coord_polar("y", start=0) +
mintheme() +
theme(text = element_blank()))
}
{: .language-r}
Another difficult to interpret quasi-pie chart which shows how
difficult it is to see and quantify differences in a pie chart:
{:
height=“300px”}
The arrangement of multiple plots and panels can also contribute to increasing the clarity of a visualization. Have a look at the following plot.
set.seed(123)
df <- data.frame(var2=c(rnorm(50),rnorm(50,3),rnorm(50,6)), var1=c(rnorm(150)), sample=rep(1:3, each=50))
plotls1 <- purrr::map(1:3, ~ ggplot(df[df$sample==.x,]) + geom_point(aes(var2,var1)) + facet_wrap(~sample) +
mintheme())
plotls2 <- purrr::map(3:1, ~ ggplot(df[df$sample==.x,]) + geom_boxplot(aes(var1)) + facet_wrap(~sample) +
mintheme() + labs(y=""))
ggpubr::ggarrange(nrow=2, ncol=1,
ggpubr::ggarrange(plotlist = plotls1, nrow = 1, ncol=3),
ggpubr::ggarrange(plotlist = plotls2, nrow = 1, ncol=3)
)
{: .language-r}
Two inconsistencies are present. First of all the order of the sample
of the top row and the bottom row is not the same. Secondly in the top
row var1 is on the y-axis while in the bottom row it is on
the x-axis. Staying consistent and in general have an arrangement that
makes sense helps to have a clear representation that transmits the
desired information efficiently. A better alternative for the above plot
is:
plotls1 <- purrr::map(1:3, ~ ggplot(df[df$sample==.x,]) + geom_point(aes(var1,var2)) + facet_wrap(~sample) +
mintheme())
plotls2 <- purrr::map(1:3, ~ ggplot(df[df$sample==.x,]) + geom_boxplot(aes(var1)) + facet_wrap(~sample) +
mintheme() + labs(y=""))
ggpubr::ggarrange(nrow=2, ncol=1,
ggpubr::ggarrange(plotlist = plotls1, nrow = 1, ncol=3),
ggpubr::ggarrange(plotlist = plotls2, nrow = 1, ncol=3)
)
{: .language-r}
Let’s have a look at the following master piece:
/static/3d_plot_exercise.jpg
Answer the following questions in context to the above plot:
1
Is the 3D representation sensible?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
2
Are the legend labels sensible?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
3
Are the axis labels sensible?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
4
Is it sensible to have X-axis tick values on multiple levels?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
5
Are the Y-axis tick values reasonable?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
6
Is the Y-axis range reasonable?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
7
Are the values in the plot easily readable?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
8
Is the title suitable?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
9
Is the used color palette color-blind friendly?
- Yes
- NoSolution
F Yes
T No{: .solution} {: .challenge}
Quiz 7.1
What aspect of the “Be simple, clear and to the point” input has been violated?
- 3D
- occlusion
- use of pie charts
- arrangement of multiple plots{: .challenge}
Solution
T 3D
T occlusion
F use of pie charts
F arrangement of multiple plots{: .solution}
Quiz 7.2
Is the data shown appropriately by the plot through
- the height of bars?
- the values on bars?
- additional values in white font?
- the tick marks indicating to which value each bar belongs? - the raw data?{: .challenge}
Solution
- the height of bars?
- the values on bars?
- additional values in white font?
- the tick marks indicating to which value each bar belongs?
- the raw data?
{: .solution}
Quiz 7.3
What could be the reason that the Y-axis is shown starting at the value 55?
- the value 55 could be the smallest possible value in the context
- the differences between the bars are more pronounced than if the Y-axis starts at zero.
- the values below 55 need to be hidden{: .challenge}
Solution
T the value 55 could be the smallest possible value in the context
T the differences between the bars are more pronounced than if the Y-axis starts at zero.
T the values below 55 need to be hidden{: .solution}
Quiz 7.4
When you think about the information regarding the axes, please tick which one of the following items is present in the plot
- Sensible X-axis tick label display
- Sensible X-axis label
- Sensible Y-axis tick values
- Sensible Y axis label{: .challenge}
Solution
- Sensible X-axis tick label display
- Sensible X-axis label
- Sensible Y-axis tick values
- Sensible Y axis label
{: .solution}
Quiz 7.5
Does the displayed grid help to determine the height of the color sections of the bars?
- Yes
- No{: .challenge}
Solution
F Yes
T No{: .solution}
Quiz 7.6
Is the used color palette color-blind friendly?
- Yes
- No{: .challenge}
Solution
F Yes
T No{: .solution}
Quiz 7.7
Which of the following additional information items does the plot feature?
- Informative title
- Informative legend labels
- Provenance of data
- Context of data{: .challenge}
Solution
- Informative title
- Informative legend labels
- Provenance of data
- Context of data
{: .solution}
Quiz 7.8
Does the course team think this is a good plot?
- Yes
- No{: .challenge}
Solution
F Yes
T No{: .solution}
set.seed(123)
x1 <- rnorm(40,2)
xnorm <- c(x1,-x1)
x2 <- c(runif(15, 0, 3.1), runif(5,3.1,4))
xunif <- c(x2,-x2)
x3 <- c(rep(0,4),rep(2.05,13),rep(-2.05,13),rep(3.8,5),rep(-3.8,5))
df <- data.frame(x1=xnorm,
x2=xunif,
x3=x3)
df_long <- df %>%
pivot_longer(cols = everything(),
names_to="dataset",
values_to="y")
{: .language-r}
Boxplots are used to give a rough overview of the distribution of a data set based on a few summary characteristics (quantiles). Consider the following three boxplots each representing a different dataset. The boxplots look identical even tough the underlying distributions may not be.
ggplot(df_long) +
geom_boxplot(aes(y=y,x=dataset))+
medtheme()
{: .language-r}
The code for the above plot:
ggplot(df_long) +
geom_boxplot(aes(y=y,x=dataset))
{: .language-r}
Violin plots are an alternative to boxplots. They are based on an estimation of the underlying probability density, i.e. they use more information inherent in the data set. Have a look at the following three violin plots of the same datasets as above. Again, two of the violin plots look similar but the underlying data may not be identical.
ggplot(df_long) +
geom_violin(aes(y=y,x=dataset))+
medtheme()
{: .language-r}
Let’s finally have a look at the actual data. As you can see the samples x1 and x3 are in fact very distinct, or more precisely, x3 seems to have only 5 possible values.
ggplot(df_long) +
geom_point(aes(y=y,x=dataset)) +
medtheme()
{: .language-r}
So why did the boxplot not show the distributional differences? Since boxplots only show certain quantiles (usually the quartiles, i.e., 25%, 50% and 75%, plus “outliers”) plots of different datasets having the same or similar quantiles appear identical. The quartiles of the three data sets are
df_long %>%
dplyr::group_by(dataset) %>%
dplyr::summarise(q25 = quantile(y, probs=0.25),
q50 = quantile(y, probs=0.5),
q75 = quantile(y, probs=0.75)) %>%
knitr::kable()
{: .language-r}
| dataset | q25 | q50 | q75 |
|---|---|---|---|
| x1 | -2.080552 | 0 | 2.080552 |
| x2 | -2.044706 | 0 | 2.044706 |
| x3 | -2.050000 | 0 | 2.050000 |
Violin plots show a mirrored estimation of the underlying density using a smoothing technique. Loosely speaking smoothing means that starting from a histogram a smooth version of the underlying probability distribution is created. The degree of smoothing, ranging in this case from histogram to straight line, determines the actual appearence of the plot. For the violin plot the degree of smoothing is chosen automatically. Already histograms with 5 bins for the data x1 and x3 would be very similar and hence the smoothed versions thereof as well.
Caution is furthermore advised if the datasets that are compared are of very different size, because often more data gives you a higher confidence in the observed distribution. It is therefore advised to initially always have a look at the actual data and not just the summaries (like boxplots and violin plots) to detect anomalies.
Another option is the use of geom_jitter (or
geom_sina from the ggforce package) in
combination with violin plots:
set.seed(123)
ggplot(df_long) +
geom_violin(aes(y=y,x=dataset)) +
geom_jitter(aes(y=y,x=dataset),width=0.3) +
medtheme()
{: .language-r}
The code for the above plot:
ggplot(df_long) +
geom_violin(aes(y=y,x=dataset)) +
geom_jitter(aes(y=y,x=dataset),width=0.3)
{: .language-r}
The advantage is that individual points as well as the distribution are shown.
Boxplots together with geom_jitter are another
possibility.
set.seed(123)
ggplot(df_long) +
geom_boxplot(aes(y=y,x=dataset)) +
geom_jitter(aes(y=y,x=dataset),width=0.3) +
medtheme()
{: .language-r}
Another possibility is to only show the jittered data:
set.seed(123)
ggplot(df_long) +
geom_jitter(aes(y=y,x=dataset),width=0.3) +
medtheme()
{: .language-r}
The same as discussed before for boxplots also holds for barplots. If you have continuous data and see the following barplots you might conclude that the data sets are the same:
df_long <- df_long %>%
dplyr::group_by(dataset) %>%
dplyr::mutate(y_t =y-min(y),
y_mean=mean(y_t),
y_sd=sd(y_t),
y_sd_min=y_mean-y_sd,
y_sd_max=y_mean+y_sd)
ggplot() +
geom_col(aes(y=y_mean,x=dataset),data = unique(df_long[,c("dataset","y_mean")])) +
geom_errorbar(aes(x=dataset,ymin=y_sd_min,ymax=y_sd_max),data = unique(df_long[,c("dataset","y_sd_min","y_sd_max")]),width=0.2)+
labs(y="y") +
medtheme()
{: .language-r}
But if you also show the individual points you can see clear differences:
set.seed(123)
df_long <- df_long %>%
dplyr::group_by(dataset) %>%
dplyr::mutate(y_t =y-min(y),
y_mean=mean(y_t),
y_sd=sd(y_t),
y_sd_min=y_mean-y_sd,
y_sd_max=y_mean+y_sd)
ggplot() +
geom_col(aes(y=y_mean,x=dataset),data = unique(df_long[,c("dataset","y_mean")])) +
geom_errorbar(aes(x=dataset,ymin=y_sd_min,ymax=y_sd_max),data = unique(df_long[,c("dataset","y_sd_min","y_sd_max")]),width=0.2)+
geom_jitter(aes(y=y_t,x=dataset), data=df_long, width = 0.3)+
labs(y="y") +
medtheme()
{: .language-r}
Important to keep in mind when using barplots with error bars is to state what the error bars mean. Do they correspond to the standard deviation, the standard error or a confidence interval? There is no clear answer to which one to use and, if possible, other types of visualizations should be used.
The axes of plots determine how much information you provide and where you put the focus. You could cut axes, blow certain parts of an axis up through transformation or hide information on certain scales if you do not transform. You can expose or hide information by choosing the aspect ratio between the x and y axis. You can provide clear and precise information through meaningful labeling of axes and axis tick marks or you can obscure the same information by deliberately choosing uninformative tick locations, for example. These issues are illustrated through example in the following
set.seed(123)
df <- data.frame(x=factor(c(rbinom(10,2,0.5),rep(0:2,500))))
{: .language-r}
Let’s consider the following two barplots. The first has a shortened axis range and shows clear differences between the datasets. The second plot on the other hand shows the enire axis starting from zero and the differences disappear.
ggplot(df) +
geom_histogram(aes(x), stat="count") +
coord_cartesian(ylim=c(500,max(table(df$x)))) +
medtheme()
ggplot(df) +
geom_histogram(aes(x), stat="count")+
medtheme()
{: .language-r}
Here is a flashy, concrete example of cutting an axis, which makes differences appear much hugher than they are in reality:
The other way around is also possible. Choosing to show the entire axis starting from zero can mask differences that do matter. For instance the following graph gives “1000% proof for stable temperature and the climate change being a HOAX”, simply by showing an axis from zero degree Kelvin and thus making differences seem negligably small:
set.seed(123)
x1 <- runif(100,2,4)
df <- data.frame(x=x1, y=rnorm(100,exp(x1)+10,exp(x1)/3))
df <- data.frame(x=x1, y=rnorm(100,exp(x1+10),10))
{: .language-r}
In some cases you might have data that is on completely different scales, meaning that there are differences to be seen at different orders of magnitudes. In these cases it can often to be helpful to do an axis-transformation. For instance consider the following untransformed plot:
x1 <- c(runif(100,0.8,1.2),runif(100,8,12),runif(100,80,120),runif(100,800,1200),runif(100,8000,12000))
df <- data.frame(x=x1, y=rnorm(100,0,rep(c(1,10,100,1000,10000),each=100)))
ggplot(df) +
geom_point(aes(x,y))+
medtheme()
{: .language-r}
There seems to be some structure but especially for the low values it
is not clear what is going on. If instead you do a log10
transformation of the x-axis things get much clearer. Axis
transformations are also something to consider if you have for example
non linear scales. But beware, transformations can also be used to
showcase differences that do not really matter in practice.
ggplot(df) +
geom_point(aes(x,y)) +
scale_x_continuous(trans="log10")+
medtheme()
{: .language-r}
set.seed(123)
df <- data.frame(x=rnorm(100),
y=rnorm(100))
{: .language-r}
The aspect ratio is another important parameter that can be manipulated to overstress certain patterns. For example, have a look at the following two plots. The first as a ratio of one, meaning the scale of the x and y axis are the same. The second plot has an aspect ration of 1/4 meaning the x axis is substantially longer.
ggplot(df) +
geom_point(aes(x,y)) +
coord_fixed(ratio=1)+
medtheme()
ggplot(df) +
geom_point(aes(x,y)) +
coord_fixed(ratio=1/4)+
medtheme()
{: .language-r}
Code for the above plot:
ggplot(df) +
geom_point(aes(x,y)) +
coord_fixed(ratio=1)
ggplot(df) +
geom_point(aes(x,y)) +
coord_fixed(ratio=1/4)
{: .language-r}
Visually the second plot implies that the variance of x is much higher than of y, which is not the case.
summarise(df, x = var(x), y = var(y))
{: .language-r}
x y
1 0.8332328 0.9350631
{: .output}
Also consider the following real example:
Where does the increase look the most dramatic?
The appearance of a histogram is determined by the bin width that is used to create it. If you have a very large binwidth (or a low total number of bins) you might see something like this and you would probably consider the distribution to be approximately uniformly distributed.
set.seed(123)
df <- data.frame(x=unlist(lapply(0:9, function(i) c(rep(0.501+i,sample(1:3,1)),rep(1+i,sample(17:20,1))))))
ggplot(df) +
geom_histogram(aes(x), binwidth = 1)+
medtheme()
{: .language-r}
If on the other hand you decrease the binwidth (or increase the number of bins) you might see something like this:
ggplot(df) +
geom_histogram(aes(x), binwidth = 0.5)+
medtheme()
{: .language-r}
Making it quite obvious that the distribution is most definitely not uniformly distributed (on this scale). Choosing the correct bin width is not easy and depends largely on the context.
With geom_rug you can mark the position of individual
observations:
ggplot(df,aes(x)) +
geom_histogram(binwidth = 0.5) +
geom_rug()+
medtheme()
{: .language-r}
Code for the above plot:
ggplot(df,aes(x)) +
geom_histogram(binwidth = 0.5) +
geom_rug()
{: .language-r}
If you provide plots in multiple panels, each using the same variables, you need to pay attention to the scale of each subplot. For example have a look at the following plot.
set.seed(123)
df <- data.frame(y=c(rnorm(50),rnorm(50,3),rnorm(50,6)), x=c(rnorm(150)), sample=rep(1:3, each=50))
ggplot(df) +
geom_point(aes(x,y)) +
facet_wrap(~sample, scales = "free")+
medtheme()
{: .language-r}
At first glance the distribution of each of the three samples looks the same. But if you look closely you can see that the scales are not the same for each subplot. If you instead keep the scale the same across subplots you get a visualization with clear differences of the distributions between the different samples.
ggplot(df) +
geom_point(aes(x,y)) +
facet_wrap(~sample)+
medtheme()
{: .language-r}
Code for the above plot:
ggplot(df) +
geom_point(aes(x,y)) +
facet_wrap(~sample)
{: .language-r}
Another example of using different scales:
Trying to encode more than 8 category with colors is usually not a good idea as distinction between colors can become very difficult:
mtcars %>%
rownames_to_column() %>%
ggplot() +
geom_point(aes(mpg,disp, color=factor(rowname))) +
labs(color="") +
medtheme()
{: .language-r}
In such a case it can be a better idea to directly label the points:
mtcars %>%
rownames_to_column() %>%
ggplot() +
geom_point(aes(mpg,disp,color=cyl)) +
ggrepel::geom_label_repel(aes(mpg,disp, label=rowname),
size = 2.5,label.size = 0.1,
label.padding = 0.1)+
medtheme()
{: .language-r}
Code for the above plot:
mtcars %>%
rownames_to_column() %>%
ggplot() +
geom_point(aes(mpg,disp,color=cyl)) +
ggrepel::geom_label_repel(aes(mpg,disp, label=rowname),
size = 2.5,label.size = 0.1,
label.padding = 0.1)
{: .language-r}
See also: Common pitfalls of color use in Fundamentals of Data Visualization.
About 1 of every 12 people is affected by some type of color vision deficiency1. This is important to keep in mind when choosing colors for visualizations. For example consider the following scatter plot using a Red-Yellow-Green color palette, knowing that Red-Green colorblindness is the most frequent type of color deficiency.
ggplot(mtcars) +
geom_point(aes(mpg,disp, color=factor(carb))) +
scale_colour_brewer(palette="RdYlGn")+
labs(color="carb") +
medtheme()
{: .language-r}
To check how the plots appear for color deficient persons you can use
the cvd_grip function from the colorblindr
package (install instructions on the Github colorblindr
repo).
colorblindr::cvd_grid()+
medtheme()
{: .language-r}

Using a different color palette can help. For example the following:
ggplot(mtcars) +
geom_point(aes(mpg,disp, color=factor(carb))) +
scale_color_OkabeIto()+
labs(color="carb") +
medtheme()
{: .language-r}

Code for the above plot:
ggplot(mtcars) +
geom_point(aes(mpg,disp, color=factor(carb))) +
scale_color_OkabeIto()
{: .language-r}
Another option is the dichromat package which features
multiple palettes for people with red-green colorblindness.
Since the amount of possible visualizations can be quite overwhelming there exist guidelines to choose the optimal plot type that can be used as a starting point. For example the following decision tree:
But keep in mind that this is no rule and depending on the situation different options can be favorable.
ggplot2We additionally provide a quick introduction to the widely used
package ggplot2. It is based on the idea of a grammar of
graphics, in other words well defined instructions of how to create a
plot.
We will not go further into the theoretical details but instead jump
right into the practical part. Before creating our first ggplot we load
ggplot2 and have a look at the cars dataset
that we will use.
library(ggplot2)
head(mtcars)
{: .language-r}
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
{: .output}
Running the following command will initialize a ggplot object without yet showing anything since we don’t tell it what to show.
ggplot(mtcars)
{: .language-r}
To create a plot we have to specify what kind of plot - or geom - we
want to use. In this case we want a scatter plot so we choose
geom_point. For other possibilities check the
ggplot2 documentation (e.g. ?ggplot2) or just
do a Google search. Furthermore we have to specify which columns in our
dataset we want to use for which axis. Or in other words assign columns
of the data to the x and y aesthetic (aes). To check which
aesthetic is available for which geom check the Aesthetics
paragraph in the documentation in the respective geom (e.g. in
?geom_point).
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp))
{: .language-r}
If we want to show mpg on the y-axis and
disp on the x-axis:
ggplot(mtcars) +
geom_point(aes(x=disp, y=mpg))
{: .language-r}
If we want a boxplot plot we could do
ggplot(mtcars) +
geom_boxplot(aes(x=mpg))
{: .language-r}
or:
ggplot(mtcars) +
geom_boxplot(aes(y=mpg))
{: .language-r}
To color the plot according to another column in the data the
color (or colour) aesthetic can be used.
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl)))
{: .language-r}
Many options to change the appearance of plots are available For example change labels:
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl))) +
labs(x="X - Axis", title = "ggplot2", color="")
{: .language-r}
Or change the theme (for all options check out the help page from
theme). Either using a predefined theme.
# preconfigured
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl))) +
theme_bw()
{: .language-r}
Or creating your own. (Can you figure out what each of the arguments
in theme do?)
# change settings yourself
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl))) +
theme(panel.background = element_rect(fill = "yellow", color="red"),
plot.background = element_rect(fill="blue"),
legend.background = element_rect(fill="red"),
axis.title = element_text(colour = "white"),
axis.line = element_line(linetype = 7,colour = "black",size=3),
panel.grid = element_line(colour="grey",size = 1,linetype = 2),
axis.text = element_text(angle = 45,hjust=1,colour = "lightgrey",size = 14))
{: .language-r}
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
{: .warning}
Changing the color palette used can be done by using either a custom
palette (in this case generated using the RColorBrewer
package)
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl))) +
scale_color_manual(values=RColorBrewer::brewer.pal(3, "Set2"))
{: .language-r}
or by directly using an existing scale (usually of the form
scale_color_NAME). Further options will be given below.
ggplot(mtcars) +
geom_point(aes(x=mpg, y=disp, color=factor(cyl))) +
scale_color_viridis_d()
{: .language-r}
How do you create a scatter plot with ggplot2?
ggplot(df) + geom_point(aes(x,y)) (x)ggplot(df, aes(x,y)) + geom_point(aes(x,y)) (x)ggplot(df, aes(x,y)) + geom_point() (x)ggplot(df) + geom_point()geom_point(df, aes(x,y))Color by group?
ggplot(df) + geom_point(aes(x,y, color=group)) (x)ggplot(df) + geom_point(aes(x,y, colour=group))
(x)ggplot(df, aes(color=group)) + geom_point(aes(x,y))
(x)ggplot(df, aes(colour=group)) + geom_point(aes(x,y))
(x)ggplot(df) + geom_point(aes(x,y)) + geom_col(aes(group))ggplot(df) + geom_point(aes(x,y)) + theme(group.color=aes(group))
For this homework we will work with climate data published by the Bundesamt für Statistic BFS in which various climate related variables measured at different locations in Switzerland have been put together. The data has already been wrangled into a csv file that you can download from here.
The source data was downloaded from here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.02 and here: https://www.bfs.admin.ch/asset/de/je-d-02.03.03.03
You will upload a pdf file produced from R Markdown containing the answers the to the questions below in the next step (allowed file format is .pdf). Make sure the code to produce the answers is shown. Your submission will be peer reviewed. Staff will perform random checks on the peer review.
In this first task read in the climate_data.csv file and
do a short exploration of the dataset.
climatedf_comp <- read.csv(here::here("../files/docs/07/climate_data.csv"))
{: .language-r}
Show the top 3 rows of the dataset and additionally a short summary
of the dataset (Hint: use summary). Describe what
you observe in a few words.
Annual_temperature and
YearThe goal is to visualize the association of
Annual_temperature and Year. To increase the
visibility we will only look at the locations
ZürichFluntern, Säntis, Samedan,
LocarnoMonti.
Choose a suitable visualization (maybe consider looking at the decision tree) and plot the respective graph.
Based on the previous plot update / change your plot to also include the information about the altitude. Make sure that the location information is also provided.
In the next step we want to normalize the Annual temperature by using
the values of the years <1951 as a base. I.e. calculate the mean
Annual_temperature for Year<1951 for each
Location and subtract this value from
Annual_temperature. Present a visualization that allows to
study the deviation from this annual mean by location.
Annual_Precipitation, and
Sunshine_durationThe next goal is to explore associations between
Annual_Precipitation, and Sunshine_duration
for the locations
ZürichFluntern,Säntis,Samedan,LocarnoMonti.
Present at least two different types of plots.
We have already shortly had a look at facets which allow to easily
arrange multiple plots. But so far we have only considered the case
where each subplot shows the same variables, e.g
Sunshine_duration vs. Annual_frost_days. What
if instead you would like to use facets to plot multiple variables? For
instance you would like to do a plot containing two subplots, the first
Annual_frost_days vs. Sunshine_duration and
the second Annual_summer_days
vs. Sunshine_duration?
There are basically two options:
We will in the following explore both options.
There are many options available how to combine plots. Two useful
packages are cowplot (for all graphics) and
ggpubr (for ggplots). In this exercise we will use
ggpubr.
Create two ggplot2 scatterplots,
Annual_frost_days vs. Sunshine_duration and
Annual_summer_days vs. Sunshine_duration,
color by location. Combine the two plots using
ggpubr::ggarrange and make sure to have only one legend.
Also make sure to have the same axis range in both plots.
The second option is to use facets
(e.g. ggplot2::facet_wrap). Since our data is currently not
in the correct format we first have to bring it into shape. This can be
done using tidyr::pivot_longer which transforms data from
wide to long format. The wide format means we have multiple values per
row while the long format means we only have a single value while the
remaining columns act as an identifier of the sample. You can learn more
about pivot, long and wide formats by running
vignette("pivot",package = "tidyr") in the console.
Use tidyr::pivot_longer to bring the data into long
format and plot Annual_frost_days
vs. Sunshine_duration and Annual_summer_days
vs. Sunshine_duration in the same plot using
ggplot2::facet_wrap.
Hint: The columns to pivot into longer format are
Annual_frost_days and Annual_summer_days.
In some situations where labels on the x-axis are long they can overlap with the default setting:
Warning: Removed 129 rows containing non-finite values (`stat_ydensity()`).
{: .warning}
A solution can be to rotate the labels:
Warning: Removed 129 rows containing non-finite values (`stat_ydensity()`).
{: .warning}
Reproduce the above plot.
Hint: use the argument axis.text.x in the
theme function and make sure to check the expected input
class in axis.text.x.
You can generate custom colors using
RColorBrewer::brewer.pal. The generated colors can then be
used in combination with
scale_color_manual(values=generated_colors).
Upload a pdf containing answers to all tasks in the next step. Make sure the code to produce the answers is shown.